Language modeling for speech recognition of spoken Cantonese
نویسندگان
چکیده
This paper addresses the problem of language modeling for LVCSR of Cantonese spoken in daily communication. As a spoken dialect, Cantonese is not used in written documents and published materials. Thus it is difficult to collect sufficient amount of written Cantonese text data for the training of statistical language models. We propose to solve this problem by translating standard Chinese text, which is much easier to find, into written Cantonese. A rule-based method of translation is devised and implemented. Three different language models are trained from different types of text. They are evaluated in the task of LVCSR. Experimental results confirm that the translated text can well represent Cantonese spoken in formal occasions like broadcast news. For colloquial Cantonese, language model adaptation with a limited amount of colloquial Cantonese text data would be a practically feasible solution that leads to reasonable speech recognition performance.
منابع مشابه
Automatic Recognition of Cantonese-English Code-Mixing Speech
Code-mixing is a common phenomenon in bilingual societies. It refers to the intra-sentential switching of two different languages in a spoken utterance. This paper presents the first study on automatic recognition of Cantonese-English code-mixing speech, which is common in Hong Kong. This study starts with the design and compilation of code-mixing speech and text corpora. The problems of acoust...
متن کاملRecent Advances in Cantonese Speech Recognition
This paper describes our recent work on automatic recognition of Cantonese. Cantonese is one of the major Chinese dialects, spoken by tens of millions of people in Southern China and Hong Kong. For isolated Cantonese syllables, a neural network based recognition algorithm has been successfully developed and the most up-to-date recognition results are presented. For continuous Cantonese speech, ...
متن کاملSpoken language resources for Cantonese speech processing
This paper describes the development of CU Corpora, a series of large-scale speech corpora for Cantonese. Cantonese is the most commonly spoken Chinese dialect in Southern China and Hong Kong. CU Corpora are the first of their kind and intended to serve as an important infrastructure for the advancement of speech recognition and synthesis technologies for this widely used Chinese dialect. They ...
متن کاملDesign, Compilation and Processing of CUCall: A Set of Cantonese Spoken Language Corpora Collected Over Telephone Networks
The design and compilation of the CUCall telephone speech corpora is described in this paper. Speech database is an indispensable resource for research and development of state-of-the-art spoken language technology. These speech recognition systems rely greatly on a huge amount of well-designed and appropriately processed speech data for parameters training. On the other hand, as telephony appl...
متن کاملTowards Highly Usable and Robust Spoken Language Technologies for Chinese
This paper gives an overview of our research on Chinese spoken language technologies during the past ten years. It covers fundamental acoustic-phonetic studies of spoken Cantonese, speech corpora development, automatic speech recognition and text-to-speech. Currently our focus is on making these technologies more usable for general users who are not speech experts, and more robust for real-worl...
متن کامل